In this notebook I'd like to show, how we can plot a large dataset (36 million points, in this case) on a single machine (Macbook air, in my case) using two new libraries, - dask and datashader
%matplotlib inline
import pylab as plt
from ipynotifyer import notifyOnComplete as nf
import numpy as np
import pandas as pd
import datashader as ds
import datashader.transfer_functions as tf
from dask import dataframe as dd
import dask
from functools import partial
from datashader.utils import export_image
from datashader.colors import colormap_select, Greys9, Hot, viridis, inferno
from IPython.core.display import HTML, display
from pyproj import Proj # reproject points to State Plane
nyc = Proj(init='epsg:2263')
def reproj(df, prj=nyc):
d = nyc(df['lon'].values, df['lat'].values)
df[['x','y']] = pd.DataFrame({'x':d[0],'y':d[1]})
return df
Get the data¶
dsk = dd.read_csv('data/data*.csv', encoding='utf8')
len(dsk) # size of the dataset
Process the data¶
- column as categorical, lower
dsk = dsk.assign(application=dsk.application.str.lower()) #.astype('category'))
- reproject to NYC state plane
dsk = dsk.map_partitions(reproj)
- add daytime in seconds
dsk = dsk.assign(daytime=dsk.timestamp.mod(86500))
Now let`s play with dask graph visualisation, just because it is awesome. As we can see, data is split into many "chunks" of data, and for each a set of transformations is performed (all operations are row-wise for now).
dsk.visualize()
And now let's actually comute the result.
d = dsk.compute()
VISUALISATION¶
Now lets prepare to visualise our map using datashader.
First, let's define a canvas size
plot_width = int(1000)
plot_height = plot_width
background = "black"
Datashader examples propose to use partial helper - we don't want to define background stile every time
export = partial(export_image, background = background)
cm = partial(colormap_select, reverse=(background!="black"))
Also, we need notebook to be large
display(HTML(""))
Now let's define our data-side canvas coordinates. we can simply reproj them from lot/lan as well
sw = nyc( -74.15, 40.463661 ) # reproj
ne = nyc( -73.66, 40.947435 ) # reproj
NYC = x_range, y_range = zip(sw, ne)
cvs = ds.Canvas(plot_width, plot_height, *NYC)
Dencity¶
First, lets just count tweets for each point.
count = cvs.points(d, 'x', 'y')
Lets start with linear. It is actually 99% times a bad idea, but yet.
export(tf.interpolate(count, cmap = Greys9, how='linear'),'tweets_density_linear')
As we expected, it is really not helping, lets stick with equal histogram. This mean, that for each color in the colormap, buckets are adjusted, so tha each color represents equal number of points
export(tf.interpolate(count, cmap = Greys9, how='eq_hist'),'tweets3')
Now, grey is kinde boring, and hard to define real clusters of density
export(tf.interpolate(count, cmap=viridis, how='eq_hist'), 'colored_total')
Applications¶
Now, lets define, which of 4 top application is the most popular for each point
I actually started defining colors. Strange thing to start with, but this way I am able to use keys to filter apps later
if background == "black":
color_key = {'foursquare':'aqua', 'twitter for iphone':'white', 'instagram':'red', 'twitter for android':'grey'} #, 'o':'yellow' }
else: color_key = {'foursquare':'blue', 'twitter for iphone':'white', 'instagram':'red', 'twitter for android':'grey'} # 'o':'saddlebrown'}
Filter data for top-4 applications, just as with pandas
appDf = d[d.application.isin(color_key.keys())]
Now, lets turn application to categorical type
appDf = appDf.assign(application=appDf.application.astype('category'))
appDf.application.value_counts()
Now count by category
appCount = cvs.points(appDf, 'x', 'y', ds.count_cat('application'))
And plot
export(tf.colorize(appCount, color_key, how='eq_hist'), 'colored_apps')
Daytime¶
now, lets visualise our daytime. Here, I use "hsv" colormap, as I want numbers for 00:05 and for 23:55 to be close enough.
Also, I remove noise (points with less than 10 tweets), using count aggregate, which we already computed
treshold = 10
aggDaytime = cvs.points(d, 'x', 'y', agg=ds.mean('daytime'))
colormap = plt.get_cmap('hsv')
export(tf.interpolate(aggDaytime.where(count > treshold ), cmap=colormap, how='eq_hist'), 'colored_total')
And that is the end of the notebook.